Typeset with jpsj2.cls <ver.l.2> Full Paper Statistical Mechanics of Linear and Nonlinear Time-Domain Ensemble Learning Seiji MIYOSHI 1 * and Masato OKADA 234 1 Department of Electronic Engineering, Kobe City College of Technology, 8-3 Gakuenhigashimachi, Nishi-ku, Kobe-shi, 651-2194 2 Division of Transdisciplinary Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba, 277-8561 3 RIKEN Brain Science Institute, 2-1 Hirosawa, Wako-shi, Saitama, 351-0198 Conventional ensemble learning combines students in the space domain. In this paper, how- ever, we combine students in the time domain and call it time-domain ensemble learning. We analyze, compare, and discuss the generalization performances regarding time-domain ensemble learning of both a linear model and a nonlinear model. Analyzing in the framework of online learning using a statistical mechanical method, we show the qualitatively different behaviors between the two models. In a linear model, the dynamical behaviors of the gener- alization error are monotonic. We analytically show that time-domain ensemble learning is twice as effective as conventional ensemble learning. Furthermore, the generalization error of a nonlinear model features nonmonotonic dynamical behaviors when the learning rate is small. We numerically show that the generalization performance can be improved remarkably by using this phenomenon and the divergence of students in the time domain. KEYWORDS: ensemble learning, online learning, generalization error, statistical mechanics 1. Introduction Learning can be roughly classified into batch learning and online learning. 1 In batch learning, given examples are used more than once. In this paradigm, a student will give the correct answers after training if that student has an adequate degree of freedom. However, it is necessary to have plenty of time and a large memory for storing many examples. On the contrary, in online learning examples used once are then discarded. In this case, a student cannot give correct answers for all the examples used in training. There are, however, merits. For example, a large memory for storing many examples is not necessary, and it is possible to follow a time-variant teacher. Recently, we analyzed the generalization performance of some models in a framework of online learning using a statistical mechanical method. 2-6 Ensemble learning means to combine many rules or learning machines (called students in this paper) that perform poorly; this has recently attracted the attention of many researchers. 2, 3,7-10 The diversity or variety of *E-mail address: miyoshi@kobe-kosen.ac.jp 1/11 J. Phys. Soc. Jpn. Full Paper students is essential in ensemble learning. We showed that the three well-known rules, Hebbian learning, perceptron learning, and AdaTron learning have different characteristics in their affinities for ensemble learning, that is in "maintaining diversity among students" 3 D In that process, 13,14 it was subsidiarily proven that in an unlearnable case, 11 ' 12 the student vector does not converge in one direction but continues moving. Therefore, we also analyzed the generalization performance of a student supervised by a moving teacher that goes around a true teacher, 4 proving that the generalization error of a student can be smaller than a moving teacher's, even if the student only uses examples from the moving teacher. In an actual human society, a teacher observed by a student does not always present the correct answer. In fact, many cases the teacher is learning and continues to change. Therefore, analyzing such a model is interesting in terms of considering the analogies between statistical learning theories and real human society. In conventional ensemble learning, generalization performance is improved by combining students who have diversities. However, students do not always converge in one direction but may continue moving in an unlearnable model. Therefore, generalization performance in such a model must be improved by combining students themselves at different times, even if there is only one student. Conventional ensemble learning combines students in the space domain. In contrast, we introduce a method of combining the students in the time domain, which we call "time-domain ensemble learning". 6 Some studies 15-18 have treated the combining of students in the time domain. We partic- ularly pay attention to dynamical behaviors of the generalization performance of the time- domain ensemble learning and theoretically analyze it by applying a statistical mechanical method. We analytically or numerically obtain, compare, and discuss the order parameters and the generalization errors of two models: a linear model in which both teacher and student are linear perceptrons 2 with noise and a nonlinear model in which both teacher and student are nonlinear perceptrons. The results show that the two models have the qualitatively different behaviors. We analytically demonstrate that time-domain ensemble learning of a linear model is twice as effective as conventional ensemble learning. Furthermore, we numerically show that the generalization performance of a nonlinear model can be ramarkably improved by using nonmonotonic dynamical behaviors and the divergence of students in the time domain. 2. Model In this paper we consider a teacher and a student. They are perceptrons with the con- nection weights B and J m , respectively, where m denotes the time step. For simplicity, the connection weights of the teacher and the student are simply called the teacher and the stu- dent. Teacher B = (B lt B N ), student J m = (Jf , . . . , J%), and input x m = (xf, ...,x%) are iV-dimensional vectors. Each component B{ of B is independently drawn from Af(0, 1) and fixed, where Af(0, 1) denotes a Gaussian distribution with a mean of zero and a variance of 2/11 J. Phys. Soc. Jpn. Full Paper unity. Each component jf of the initial value J° of J m is independently drawn from JV(0, 1). The direction cosine between J m and B is R m and that between J m and J m is q m ' m . Each component xf 1 of x m is drawn from JV(0, 1/iV) independently. Figure 1 illustrates the relationship among teacher _B, students J m and J m and the direction cosines R m ,R m ' , and q m ' m ' . B / / i Fig. 1. Teacher .B and students J m and J m . R m ,R m , and g m > m are direction cosines. In this paper, we also deal with the thermodynamic limit N —* oo. Therefore, \\B\\ = y/N, \\J°\\ = y/N, and \\x m \\ = 1. Generally, since the norm II J m \\ of the student changes as the time step proceeds, the ratios l m of the norm to \HV are introduced and are called the length of the student. That is, || J m \\ = l m ^. Linear Case Outputs of the teacher and the student are = v' m +n 7 ^ and of = u m l m +n'j l , respectively. Here, v m = B-x m ,u m l m = J m -x m ,n™ ~7^(0,a|) ,nj ~ M (0, aj) , where AA(0, a 2 ) denotes a Gaussian distribution with a mean of zero and a variance of a 2 . That is, the outputs of the teacher and the student include independent Gaussian noises with variances of a 2 B and a 2 , respectively. Then, v m and u m obey Gaussian distributions with a mean of zero, a variance of unity, and a covariance of R m . Let us define the error between the teacher B and the student J m alone by the squared error of their outputs: ^ = \ (°S - oj) 2 . (1) Student J rn adopts the gradient method as a learning rule and uses input x and an output of teacher B for updates. That is, am J m+1 = jm -VQjk (2) + 7](y + n B — u I —nj)x , (S) 3/11 J. Phys. Soc. Jpn. Full Paper where n denotes the learning rate of the student and is a constant positive number. The part rj {v m + n B — u m l m — n™) has been determined by the learning rule. Generalizing this part, we denote it with f m . Nonlinear Case The teacher and the student are a nonmonotonic perceptron and a sim- ple perceptron, respectively; their outputs are o B = sgn((f m — a)v m (v m + a)) , o™ = sgn(u m l m ). Here, v m = B ■ x m and u m l m = J m ■ x m , v m and u m obey Gaussian dis- tributions with a mean of zero, a variance of unity, and a covariance of R m , and sgn(-) denotes a sign function. Student J m adopts the perceptron learning as a learning rule for updates. That is, jm+l = j™ + r? e(-0^ y) ^ m , (4) where n denotes the learning rate of the student and is a constant positive number, ©■(•) denotes a step function. The part ri<d (— o^Oj 1 ) has been determined by the learning rule. Generalizing this part, we denote it with f m . 3. Theory 3.1 Generalization error Conventionally, ensemble learning means to improve performance by combining many students that perform poorly. We, however, use just one student and combine copies of it (hereafter called "brothers") at different time steps in this paper. Conventional ensemble learning combines students in the space domain, whereas, we do so in the time domain. In this paper K brothers J mi , J" 12 , . . . , J niK are combined. Here, m\ < m-i < . . . < mx- One goal of statistical learning theory is to theoretically obtain generalization errors. Since a generalization error is the mean of errors over the distribution of the new input x , the generalization errors e 9 of the ensemble in linear and nonlinear cases are calculated as follows: Linear Case We use the squared error e for new input x. Here, it is assumed that the Gaussian noises are independently added to the teacher and each brother of the ensemble. The weight of each brother J nik of the ensemble satisfies Ck > 0. That is, the error of the ensemble is \(^-x + n B -Y^Ck{J m *-x + n k )^ , (5) where ree ~ M(0,a B ) and nk ~ A/"(0, o~j). Thus, the generalization error e 9 of the ensemble is calculated as follows: e 9 = J dxdriB ^ JJ dn^ p(x)p(n B ) ^ JJ p{n k )^ e (6) = J dv ( f[ du k j dn B ( f[ dn k j p(v, {u k })p{n B ) ( f[ p(nk) J \k=l / \fe=l / \fc=l 4/11 J. Phys. Soc. Jpn. Full Paper x^u + n B -5^C fc (u fc Z m *+n fc )j (7) / K K K = - i -2j2c k r k R mk + zJ2Yl c k c k d m n m « q m ^ m y V k=l k=lk'>k + EQ 2 (^) 2 + 4 + EQM), (8) k=l k=l / where v = B x and u k l mk = J nLk ■ x. We executed integration using the following: v and u k obey A/"(0, 1). The covariance between v and u k is i? mfc , and that between u k and u^' i S q m k,m k , _ Nonlinear Case A majority vote by brothers might be a typical method of combining for a nonlinear model in which the output of each student is +1 or —1. However, to simplify the analysis we apply the following method: the output of the ensemble is that of a new perceptron of which the connection weight is the weighted sum of the normalized connection weights J tk /l tk of brothers. That is, e = © ^- (v - a) v (v + a) ^ C k Ukj , (9) where C k > is a weight of each brother J mk in the ensemble. Thus, the generalization error e g of the ensemble is expressed as follows: dxp{x)e = J dv du k ^j p(v, {u k })Q ^— (v — a) v (v + a)J2C k u k ^j . (10) 3.2 Differential equations for order parameters, and their solutions In this paper, we examine the thermodynamic limit N — ► oo. To do so, updates of Eq.(3) or Eq.(4) must be executed O(N) times for the order parameters l,R, and q to change by 0(1). Thus, the continuous times t±, . . . , tx, which are the time steps mi, . . . , mx normalized by the dimension TV, are introduced as the superscripts that represent the learning process. To simplify the analysis, we introduced the following auxiliary order parameters r* = and Q*'*' = l t l t 'q t ' t ' . The simultaneous differential equations in deterministic forms, 19 which describe the dynamical behaviors of order parameters, have been obtained based on the self- averaging of thermodynamic limits as follows: ^ ~ </'«'> + £ - (/'»'), (12) ^ = !'(/''6'>, (13) 5/11 J. Phys. Soc. Jpn. Full Paper where t' > t and u f = x 1 ' ■ J f //* ~ AA(0, 1). Four sample averages in Eqs.(ll)-(13) are obtained by executing integrations where i>* , u* and u* obey the triple-Gaussian distribution p(v 1 ' , u 1 ' , u l ), for which the covariance matrix is / 1 R f ' R f \ S = R? 1 g*.** . (14) \ i?* g*'*' 1 j Linear Case The four sample averages can be easily calculated as follows: </V> = r?(r*/Z* - /*), (15) <(/<) 2 > = v \l + a% + a 2 J + (l t ) 2 -2r t ), (16) (/V) = tfl-r'), (17) (/*'«*) = r? (r* - Q*'*') /I*. (18) Using R° = 0,1° = 1 and Q*'* = (Z*) 2 as initial conditions, we can analytically solve the simultaneous differential equations Eqs.(ll)-(13) as follows: 6 r t = 1 _ e -r,t^ (19) {?? = l + ^K + ^)-2e^+(2-^K + a 2 ))e^, (20) Q*'*' = 1 - e"* + e-"*' + ((/*) 2 - 1) e"^*'-*) . (21) Substituting Eqs.(19)-(21) into Eq.(8), the generalization error e g can be analytically obtained as a function of time t^, k = 1, . . . , K. Nonlinear Case The four sample averages are obtained as follows: ^-TsH'^K)- 1 )- 1 )- <22) (l/'l-; 2„M /°°d„h( , R '° U/Vfff- , R '" 1 I. (2:1] a 1-iR 1 ) 2 Jo \ Jl-(R^ 2 </V) = -^(2exp(-^)-l-i?M. (2 1) ^7T V V 2 </*V) = - _ 1 - </'V) + (rf - g*'^) </V>) , (25) where H{u) = Dx, Dx = -^== exp (—^Pj • The generalization error e g is obtained by solving Eqs.(10)-(13), and (22)-(25) numerically. 4. Results and Discussion Figure 1 shows examples of the dynamical behaviors of I and R in a linear model obtained analytically, and the corresponding simulation results, where N = 2,000. In the case of a 6/11 J. Phys. Soc. Jpn. Full Paper linear model, many properties regarding both dynamical behaviors and steady states can be analytically proven. 6 For example, I and e g diverge unless < rj < 2. In the case of no noise, I asymptotically approaches unity after becoming larger than unity if < rj < 1 and I asymptotically approaches unity after becoming smaller than unity if 1 < rj < 2. The larger rj is, the faster R rises. However, the convergence of R is fastest when the learning rate satisfies rj = 1. This phenomenon can be understood by the fact that rj = 1 is a special condition for the gradient method where the student uses up the information obtained from input x. 5 Figure 2 displays some examples of the dynamical behaviors of I and R in a nonlinear model obtained numerically, and the corresponding simulation results, where N = 2, 000. The reason why R is negative, which differs from the linear case, is that the threshold a of a nonmonotonic teacher is greater than the critical value ac = \/2 In 2 ~ 1.18. 11 This is not essential in this paper. When the learning rate rj is relatively large, the dynamical behavior of R is monotonic; however, when rj is small, the dynamical behavior of R becomes nonmonotonic. That is, \R\ asymptotically approaches a steady value after reaching its maximal one. The steady value is not dependent upon rj. 1.4 1.2 1 0.8 0.6 0.4 0.2 2 3 t Fig. 2. Dynamical behaviors of I and R in the linear case, rj = 1.0, a 2 B = 0.3, <t 2 j = 0.5, Figure 4 (left) presents some examples of the dynamical behaviors of q of a linear model obtained analytically, and the corresponding simulation results, where N = 2, 000. For a linear model, q increases monotonically when t increases, and increases monotonically when t' — t decreases. Furthermore, (?*'*' converges to a smaller value than unity in the case of t' — t 7^ 0.0. This means the student itself continues to move after the order parameters reach a steady state. Figure 4 (right) shows the relationship between t\ and e g in the case of constant ti — t\ and K = 2. Here, e g increases monotonically, remains constant, or decreases monotonically depending on the values of rj. 7/11 J. Phys. Soc. Jpn. Full Paper 0.5 -0.5 10 20 l(Eta=1.0) 1 (Eta=0.5) 1 (Eta=0.2) R(Eta=1.0) R (Eta=0.5) R (Eta=0.2) 30 t 40 50 60 Fig. 3. Dynamical behaviors of I and R in the nonlinear case, a = 2.0. The behaviors of e g when the leading time t\ — > oo and the time interval tk+\ — tk — ► oo can be theoretically obtained in the case of a linear model as follows: 6 e g decreases as rj decreases regardless of K. When the weights are uniform or Cy. = C = 1/K and K = oo, B = limx-»oo ^ J2k=i J tk ■ This m eans the generalization error equals the residual error caused by teacher's noise ns- On the other hand, the generalization error e g of K = oo is \ times of that of K = 1 when the learning rate satisfies rj = 1, the uniform weights Ct = 1/K, a 2 B = a 2 j, t\ — > 00, and — — > 00. Since the generalizaion error e 9 of the conventional space-domain ensemble learning with K = 00, rj = 1, C). = l/K and <j 2 b = a 2 is \ times of that of K = l, 2 we can say the time-domain ensemble learning is twice as effective as the conventional space-domain ensemble learning. This difference can be explained as follows: In conventional space-domain ensemble learning, the similarities among students become high since all students use the same examples for learning. In time-domain ensemble learning, on the other hand, the similarities among brothers become low since all brothers use almost totally different examples for learning. Figures 5-7 (left) show some examples of the dynamical behaviors of q for a nonlinear model obtained numerically, and the corresponding simulation results, where iV = 2,000. These figures indicate that q for a nonlinear model behaves nonmonotonically for t when 77 is small. Figures 5-7 (right) show the relationship between t\ and e g in the case of constant t 2 — 1\ and K = 2. The steady value of e g is dependent upon i 2 — 1\ and is not dependent upon rj. However, when rj is small, e g behaves nonmonotonically for t\ and has the minimal value shown in Figs. 6 (right) and 7 (right). This phenomenon can be considered a kind of over- learning. Figure 7 (right) shows that the minimal value of e g decreases when t 2 — t\ increases as — > 5 — > 10 and increases when ti — 1\ increases as 10 — > 20. This means that t% — 1\ has an optimum value. Figures 6 (right) and 7 (right) reveal that the smaller the learning rate rj is, 8/11 J. Phys. Soc. Jpn. Full Paper t'-t=0.0 t'-t=0.5 t'-t=1.0 t'-t=1.5 t'-t=2.0 t'-t=10.0 b 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 t2-tl=0.0 t2-tl=1.0 _ V t2-tl=2.0 « \ t2-tl=5.0 ■ \ t2-tl=10.0 1 2 3 4 5 tl Fig. 4. Dynamical behaviors of q and e g in a linear case , rj 1/K. 1.0, <r| 0.3, ct} = 0.5, if = 2, C fc = l 0.8 0.6 0.4 0.2 I 10 20 30 t t'-t=0 f-t=l t'-t=2 f-t=5 40 50 60 o w 1 e o 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 t2-tl=0 t2-tl=5 t2-tl=10 t2-tl=20 10 20 30 tl 40 50 60 Fig. 5. Dynamical behaviors of q and e g in a nonlinear case , rj — 1.0, a = 2.0, K = 2, Ck = 1/K. the smaller the minimal value of e g is. However, if r\ is too small, the learning becomes slow. Therefore, if the aim is to decrease the generalization error e g , we should use the smallest rj that is possible from the viewpoint of learning speed, set if — t to the optimum value, and stop the learning at an adequate time step. 5. Conclusion We have analyzed the generalization performances regarding time-domain ensemble learn- ing of both a linear model and a nonlinear model. Analyzing within the framework of online learning using a statistical mechanical method, we have demonstrated the qualitatively dif- ferent behaviors between the two models. In a linear model, the dynamical behaviors of the generalization error are monotonic. We have analytically shown that time-domain ensemble learning is twice as effective as conventional ensemble learning. Furthermore, the generaliza- tion error of a nonlinear model exhibits nonmonotonic dynamical behaviors when the learning rate is small. We have numerically shown that the generalization performance can be remark- 9/11 J. Phys. Soc. Jpn. Full Paper 1 0.5 0.45 \ 0.8 8 b 0.4 \ t2-tl=0 i \ t2-tl=5 0.6 31 w \ \ t2-tl=10 - / » . a::.: | 0.35 1 •alizat 0.4 f-t=0 * - 0.3 -\ y t2-tl=20 0.2 f-t=2 t'-t=5 t'-t=l() □ " Genei 0.25 0.2 0.15 10 20 30 40 50 60 10 20 30 40 50 t tl Fig. 6. Dynamical behaviors of q and e g in a nonlinear case . r\ = 0.5, a = 2.0, K = 2, Ck = 1/K. Fig. 7. Dynamical behaviors of g and e g in a nonlinear case , rj = 0.2, a = 2.0, K = 2, Ck = 1/K. ably improved by using this phenomenon together with the divergence of students in the time domain. Acknowledgments This research was partially supported by the Ministry of Education, Culture, Sports, Sci- ence, and Technology of Japan, with Grants-in-Aid for Scientific Research 14084212, 15500151, 16500093, and 18500183. 10/11 J. Phys. Soc. Jpn. Full Paper References 1) Saad, D. (eds.) (1998) On-line Learning in Neural Networks. Cambridge: Cambridge University Press. Hara, K. & Okada, M. (2005) Ensemble learning of linear perceptron; Online learning theory. Journal of Physical Society of Japan 74:2966-2972. Miyoshi, S., Hara, K. & Okada, M. (2005) Analysis of ensemble learning using simple perceptrons based on online learning theory. Physical Review E 71:036116. Miyoshi, S. & Okada, M. (2006) Analysis of on-line learning when a moving teacher goes around a true teacher. Journal of Physical Society of Japan 75(2):024003. Miyoshi, S. & Okada, M. (2006) Statistical mechanics of online learning for ensemble teachers. Journal of Physical Society of Japan 75(4):044002. Miyoshi, S., Uezu, T. & Okada, M. (2006) Statistical mechanics of time domain ensemble learning. Journal of Physical Society of Japan 75(8):084007. Freund, Y. & Schapire, R.E. (1999) A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14:771-780. [in Japanese, translation by Abe, N.] http://www.boosting.org/ Statistical mechanics of ensemble learning. Krogh, A. & Sollich, P. (1997) Physical Review E 55:811-825. Online learning with ensembles. Urbanczik, R. (2000) Physical Review E 62:1448-1451. Inoue, J.I. & Nishimori, H. (1997) On-line AdaTron learning of unlcarnable rules. Physical Review E 55:4544-4551. Inoue, J.I., Nishimori, H. & Kabashima, Y. (1997) A simple perceptron that learns non-monotonic rules, cond-mat/9708096. Miyoshi, S., Hara, K. & Okada, M. (2004) Analysis of ensemble learning for committee ma- chine teacher. Proc. The Seventh Workshop on Information-Based Induction Sciences: 178-185. [in Japanese] Miyoshi, S., Hara, K. & Okada, M. (2005) Analysis of ensemble learning for non-monotonic teacher. IEICE Technical Report NC2004-214:123-128. [in Japanese] Dekel, O. & Singer, Y. (2005) Data-driven online to batch conversions. In G. Tesauro, D. S. Touret- zky and T.K. Leen (eds.), Advances in Neural Information Processing Systems 18. Freund, Y. & Shapire, R.E. (1999) Large margin classification using the perceptron algorithm. Machine Learning 37(3):277-296. Cesa-Bianchi, N., Conconi, A. & Gnetile, C. (2004) On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory 5O(9):2050-2057. Cesa-Bianchi, N. & Gnetile, C. (2005) Improved risk tail bounds for on-line algorithms. In G. Tesauro, D. S. Touretzky and T.K. Leen (eds.), Advances in Neural Information Processing Systems 18. Nishimori, H. (2001) Statistical Physics of Spin Glasses and Information Processing: An Introduc- tion. Oxford: Oxford University Press. 11/11